Definition of the effector landscape across 13 phytoplasma proteomes with LEAPH and EffectorComb

Abstract ‘Candidatus Phytoplasma’ genus, a group of fastidious phloem-restricted bacteria, can infect a wide variety of both ornamental and agro-economically important plants. Phytoplasmas secrete effector proteins responsible for the symptoms associated with the disease. Identifying and characterizing these proteins is of prime importance for expanding our knowledge of the molecular bases of the disease. We faced the challenge of identifying phytoplasma's effectors by developing LEAPH, a machine learning ensemble predictor composed of four models. LEAPH was trained on 479 proteins from 53 phytoplasma species, described by 30 features. LEAPH achieved 97.49% accuracy, 95.26% precision and 98.37% recall, ensuring a low false-positive rate and outperforming available state-of-the-art methods. The application of LEAPH to 13 phytoplasma proteomes yields a comprehensive landscape of 2089 putative pathogenicity proteins. We identified three classes according to different secretion models: ‘classical’, ‘classical-like’ and ‘non-classical’. Importantly, LEAPH identified 15 out of 17 known experimentally validated effectors belonging to the three classes. Furthermore, to help the selection of novel candidates for biological validation, we applied the Self-Organizing Maps algorithm and developed a Shiny app called EffectorComb. LEAPH and the EffectorComb app can be used to boost the characterization of putative effectors at both computational and experimental levels, and can be employed in other phytopathological models.


Introduction
Plants live in a constantly changing environment and have developed high phenotypic plasticity including responses and adaptations to environmental factors ( 1 ,2 ).In the case of pathogen infection, interactions are based on a molecular dialog between the pathogen and its host.Plants and pathogens are engaged in an arms race where plants have evolved stratagems to detect the bio-aggressor while the latter aims to bypass the host's immune system (3)(4)(5)(6).Plants modulate their immune response, cell signaling, metabolism and even development upon pathogen attack by acting on regulation of its transcriptome (7)(8)(9).This modulation of gene expression relies on the secretion of specific pathogenicity factors called effector proteins.The effectors' arsenal and their delivery system vary from species to species, although they have the common aim of interfering with the host's metabolism to their advantage, and to hamper the immune system, enhancing pathogen survival and, as a consequence, the development of the disease (10)(11)(12)(13)(14).
The increasing availability of genome sequences of plant pathogens has allowed dramatic advances in the characterization of pathogenicity mechanisms and the development of tools to improve effector prediction.For instance, EffectorO is a machine learning model exclusively trained on the Nterminus of oomycete effector proteins ( 15 ).Another example is EffectorP1.0 to 3.0, an ensemble of machine learning mod-els to predict both effectors and their localization (apoplastic or cytoplasmic) trained on both fungi and oomycetes to search for characteristic enrichment of amino acids and their properties (16)(17)(18).Deepredeff is a convolutional neural network trained on sequences of bacteria (Gram positive and Gram negative), fungi and oomycetes ( 19 ).DeepT3 combines different deep-learning algorithms to predict effectors secreted by the bacterial type III secretion systems (T3SSs) in Gramnegative bacteria ( 20 ).Finally, EffectiveT3 is a machine learning model to predict whether effector proteins are secreted by the T3SS ( 21 ).Despite these tools allowing advances in knowledge of effectors for some plant pathogens, there is still a lack of prediction methods for several pathogens, hampering a global characterization of pathogenic mechanisms.
A clear example of this category is the ' Candidatus Phytoplasma' genus, a group of plant-pathogenic, phloemrestricted, bacteria assigned to the Mollicutes class, cell wallless and pleomorphic bacteria, ranging from 0.2 to 0.8 μm in size (22)(23)(24)(25).Phytoplasmas are associated with diseases in a large number of crops, ornamental plants and trees that result in altered development and huge yield losses in plants (26)(27)(28)(29).Symptoms of the disease include, among the most common, witches' broom agglomerates of young branches, increased proliferation of shoots, yellowing of the leaves, virescence (flower organs become green), phyllody (flowers develop into leaf-like flowers), dwarfism, reduction in size and tasteless 2 NAR Genomics and Bioinformatics , 2024, Vol. 6, No. 3 fruits ( 24 , 30 , 31 ).The transmission from plant to plant occurs by insects, which become vectors of the pathogen once they acquire the phytoplasma during phloem feeding (32)(33)(34).The pathogen behavior of phytoplasmas and the difficult challenge of cultivating them under in vitro conditions have hindered experimental studies focusing on the identification of effector proteins.Similarly, because of their high amino acid sequence variability, the in silico identification of effector proteins exclusively based on the overall sequence similarities is inefficient ( 14 ).As a consequence, until a few years ago, only certain types of effector proteins had been identified in phytoplasmas ( 35 ).The main feature of those candidate effectors was the presence of signal peptides, restricting the diversity of effectors to only a few classes.The possible role in disease development of the following classes of effectors was mainly studied by expressing them in transgenic plants.TENGU, a small secreted peptide encoded by onion yellow phytoplasma, causes dwarfism and altered flower structures ( 36 ).SAP05 which interferes with plant vegetative growth is a secreted protein found from ' Ca .P. asteris' ( 37 ,38 ).SAP11-like proteins, first identified from aster yellow witches' broom phytoplasma, cause abnormal proliferation of young shoots and changes in leaf shape ( 37 ,39 ).SAP54 and PHYL1 are two homologous effectors that belong to the phyllogen family and cause flower malformations including phyllody, virescence and proliferation ( 37 , 40 , 41 ).
Nowadays, standard methods for effector identification in phytoplasmas are only based on the presence of a signal peptide, fairly reconstructing the phytoplasmas' secretome.However, it is important to consider that not all the secreted proteins are effectors and that some effector proteins are secreted by non-classical pathways ( 42 ,43 ).Moreover, it is shown that in Gram-positive bacteria, ancestors of phytoplasmas, the hydrophobic region of signal peptides is longer than usual, making them more similar to transmembrane regions and undetectable by software specifically designed for signal peptide prediction ( 44 ).Thus, recent methods rely on the combined predictions by software designed for both signal peptide and transmembrane domain detection ( 44 ,45 ), leading to a large number of candidates to be tested without any prioritization in assessment.Despite the recent attempt of Carreón-Anguiano et al .( 46 ) to build up an ad hoc pipeline suitable for phytoplasma effectors, there is still an urgent need for a tailored and reliable method to characterize, predict and prioritize phytoplasma effectors.
To address these needs, we developed LEAPH (ensemb L e model for E ffector cl A ssification in PH ytoplasmas) a computational method composed of an ensemble of four supervised learning models to capture distinct sets of features and efficiently predict putative effector proteins in phytoplasmas.Indeed, LEAPH predicts effectors with ∼97% accuracy, outperforming existing prediction methodologies tailored for both phytoplasmas and other pathogens.The ensemble model was trained on 479 proteins coming from > 50 ' Ca .Phytoplasma' species and then applied, as a use-case scenario, on 13 proteomes (ranging from 327 to 730 proteins), allowing us to identify a comprehensive landscape of putative candidate effectors.We used the Self-Organizing Map (SOM) algorithm to describe the properties of this landscape, which combines clustering and dimensionality reduction to embed the protein sequence similarities based on the feature profiles.Therefore, neighboring points on the map represent proteins sharing similar features, providing the first reference map for phyto-plasma effectors.We developed a user-friendly Shiny application, called EffectorComb, to investigate the effector protein map.Overall, LEAPH and EffectorComb offer the possibility to predict, interpret and explore the resulting protein candidates, thus boosting the experimental validation process from the very beginning.

Training datasets
For sequence selection and curation of the training dataset, we used the UniProt database (release 2022_01) ( 47 ), selecting proteins uniquely belonging to the phytoplasmas TAXID 33926.

Positive dataset
The positive dataset is created using two filters.Firstly the field 'Protein Names' is searched using the AND operator and each of the following regular expressions: '*Effector*', '*TENGU*', '*SAP54*', '*SAP11*', '*SAP05*' and '*PHYL1*'.Secondly, the resulting list is pruned from records that have the terms 'putative' or 'fragment' in the 'Protein names' field.The result is a set of 174 proteins from 53 phytoplasma species (Figure 1 S1.1 ).

Negative dataset
The negative dataset is constructed by manual inspection of UniProt database entries having evidence for revision in the 'Protein existence' field AND having a function not related to the known effectors' activity (e.g.no '*SVM*' or '*trigger factor*' in 'Protein Name' field).The remaining list is checked for the absence of a cross-reference with the PHI-base database ( 49 ,50 ), meaning that there is no known interaction between these proteins and those of the host.The negative dataset is finally composed of 295 non-effector proteins from 12 phytoplasma species ( Supplementary Table S1.2 ; Figure 1 A).

LEAPH method description
The LEAPH workflow consists of four main steps described hereafter and shown in Figure 1 B. Briefly, in step 1, LEAPH takes as input the amino acid sequences of interest in FASTA format and computes the 30 features, described in the following paragraph, yielding as output the feature table.During the training process (step 2), the feature table is used as input for four classifier models to be trained.After a 5-fold crossvalidation process, each of the best models is used to finally assign a class probability to every protein in the overall training dataset (step 3).LEAPH predicts a protein to be putatively pathogenic if at least one of the four models gives a class probability > 90% and thereby assigns a score ranging from 1 to 4 to each protein based on the models' agreement (step 4).The pre-trained model can be used to predict putative pathogenicity proteins in any set of protein sequences, producing an output table that contains the resulting classification along with model agreement, mean prediction probability and the corresponding amino acid sequence for each protein.

Feature calculation
The first step of LEAPH is the annotation of protein sequences and the extraction of features to describe protein sequence properties.We included a total of 30 features: (i) sequence length; (ii) signal peptide presence [using the D-score of Sig-nalP 4.1 software ( 53 ) configured as in the work of Garcion et al .( 44 )]; and (iii) transmembrane (TM) domain presence [TMHMM2.0 ( 54 )].Specifically, we included different aspects of the prediction of TM regions, namely the number of predicted TMs, (iv) the expected number of amino acids in TMs, (v) the number of expected amino acids in the first 60 positions of TM helices, (vi) the probability that the Nterminus of the protein is in the cytoplasm and (vii) a warning for possible misprediction of the TM regions [when feature (vi) is > 10], representing a potential signal peptide region [mispredicted transmembrane region (mTMR)].Since effector proteins can be composed of intrinsically disordered regions (IDRs) ( 55 ), we also predicted (viii) the eventual presence and the length of IDRs [MobiDB-Lite1.0 ( 56 ,57 )].Finally, we added 22 features concerning putative characteristic motifs of effectors: (ix-xx) by using Prosite1.86( 58 ); we calculated the occurrences of functional protein motifs belonging to the positive dataset, thus each of the 22 features is the sum of the respective predicted motif in a protein sequence; (xxi-xxx) by using MOnSTER ( 59 ), we obtained cluster of Prosite motifs (CLUMPs) considering their physicochemical characteristics.From MOnSTER application, we extracted six features representing the occurrence of each selected CLUMP and four additional features, namely the occurrence of CLUMPs in four consecutive bins of 25% of the protein sequence.The configuration parameters for all tools used to obtain the 30 features and the main descriptions are indicated in Supplementary Table S3 .

Cross-validation and h yperpar ameter tuning
Because of the size and the unbalanced distribution of proteins between classes in the training set, we applied the stratified 5-fold cross-validation method (StratifiedKFold from scikitlearn 1.1.1).Each model is trained and tested five times on a different subset of the training data respecting class proportion (80% of data for the training set and the remaining 20% for non-overlapping test sets).
We performed a hyperparameter tuning via GridSearchCV from scikit-learn 1.1.1,for each fold, and for each classifier except for Gaussian naive Bayes which was trained with the default parameters.With the grid search, the parameters are tested from a specified set of values.The starting parameters used in the training process, which differ from the default ones, are included in Supplementary Table S4 .Altogether cross-validation and hyperparameter tuning allow selection of the best model for each classifier algorithm to be used to predict new putative pathogenicity proteins.
To support the reproducibility of the machine learning method of this study, the machine learning summary table ( Supplementary Table S4 ) is included in the supporting information as per DOME recommendations ( 65 ).

Classification method
The best model for each classifier obtained from the previous phase is used to finally assign a class probability to every protein in the overall training dataset: if for at least one model the class probability is ≥ 90%, then the protein is considered as putatively pathogenic.LEAPH assigns a score ranging from 1 to 4 to each protein based on the models' agreement.The pre-trained model can be used to predict novel putative pathogenicity proteins in any set of proteins, yielding an output table containing the classification prediction, the model agreement, the mean prediction probability and the corresponding amino acid sequence for each protein.

Performance measures
To assess the performance of the prediction models used in this study, we used the following metrics: accuracy, precision, recall and F 1 score defined as:

Precision = T P T P + F P
(2) Where TP, TN, FP and FN refer to true positives, true negatives, false positives and false negatives, respectively.These measures are calculated using the sklearn.metricslibrary.

Benchmark of effector prediction tools
The state-of-the-art method commonly used in prediction of phytoplasma effectors consists of a combination of two predictors: SignalP4.1 and TMHMM2.0.Specifically, a protein is predicted as putatively pathogenic if both a signal peptide and an mTMR are predicted.This method is included in the benchmark as the reference method for phytoplasma effectors.Recently, Carreón-Anguiano et al .published the first pipeline that identifies effectors in phytoplasmas, PhyEffector ( 46 ).This pipeline adds to the previous two tools, also Phobius ( 66 ), BLASTP ( 67 ) and SecretomeP2 ( 68 ), thus we included it in the benchmarking.We also compared the performance of LEAPH with other effector prediction tools on the training set.We considered EffectorP3.0 ( 18 ), which is mainly suitable for fungi and oomycete putative pathogenicity proteins, Ef-fectorO ( 15 ), tested for oomycete putative effector prediction, and Deepredeff ( 19 ), used for putative pathogenicity protein prediction in bacteria, fungi and oomycetes.All the tools are used with default parameters or, where possible, in a configuration suitable for bacteria.

Feature importance calculation
We assessed the feature importance for each classification algorithm using the SHapely Additive exPlanations (SHAP) approach ( 69 ).SHAP is widely used for explainable machine learning and gives a more comparable spectrum of feature importance across different models.It measures the contribution of each feature to the final output using game theory concepts and feature permutations, assigning a SHAP value to each feature.Compared with the classical feature importance measurements, this additive feature method relies on maintaining the accuracy of the model, dealing with missing features and consistency in model changing for the same data.

SecretomeP2.0
To predict non-classically secreted proteins, we used Secre-tomeP2.0( 68 ).Since the Gram-positive bacteria model is currently unavailable for the latest version of the tool, we used the Gram-negative bacteria model.We split the sequences into 100 sequences per fasta file [using seqkit split ( 70 ) and setting the parameter '-s 100 ] to use the web server, and we crosschecked each prediction with the presence of a signal peptide predicted by both SignalP4.1 and TMHMM2.0, as per Secre-tomeP2.0web service recommendations.

Exploratory and functional analysis
We performed a principal component analysis (PCA) on scaled values (MinMaxScaler from scikit-learn 1.1.1) of the 30 protein features described in the section 'Feature calculation ' , and InterProScan5 ( 71 ) to predict protein domains associated with a known biological function on the predicted putative pathogenicity proteins identified by LEAPH.We then explored the differences in protein domains that occur in at least 1% of the sequences, according to the secretion modality of the predicted proteins.

Effector protein map
We used SOM ( 72 ) to exploit the properties of the putative pathogenicity proteins identified by LEAPH from 13 phytoplasma proteomes .To build up the SOM, we firstly scaled the values of each feature (MinMaxScaler from scikit-learn 1.1.1),excluding mTMR as a categorical, thus not suitable, feature and then used the aweSOM R package, which outputs a dynamic map.Through SOM visualization, points on the map in the same hexagonal cell share a similar feature vector and thus are indeed similar to each other.To find the optimal number of hexagonal cells to create the map, we tried different sizes of lattice grids from 8 × 8 up to 11 × 11, where we stopped because the 10 × 10 achieved the best balance between the errors (Quantization, Topographic and Kaski-Lagus) minimization and variance explained (see Supplementary Table S5 for further details).

EffectorComb app
We developed an interactive Shiny app to investigate the obtained SOM.The app allows the user to retrieve a set of pathogenicity proteins predicted by LEAPH with properties of interest along with their sequence identifier and amino acid composition.It is also possible to download SOM interactive image .htmlfiles.EffectorComb can be also used to project new proteins, predicted by LEAPH application, on the preobtained SOM for a deep result exploration and comparison.

A comprehensive collection of phytoplasma effectors and features to describe their sequence characteristics
To fulfill the need for a method tailored for predictions of phytoplasma effectors, we developed LEAPH, a machine learning ensemble model.Careful selection of the training set and features to use for classification purposes are crucial in machine learning applications.Therefore, by performing extensive literature and database mining, we collected 184 protein sequences from 53 ' Ca.Phytoplasma' species that make up the 'positive dataset'.The 'negative dataset' is composed of 295 proteins whose function is not related to the activity of the known effector and / or no interactions with host plant proteins are reported (Figure 1 A; Supplementary Tables S1.1 and S1.2 ).Intending to build a classifier for the prediction of novel effectors, we calculated 30 features to represent the salient characteristics of their sequences ( Supplementary Table S3 ).We included eight features that describe the protein sequence properties and the mode of secretion, 12 features summarizing the presence of protein domains important for plant invasion and infection and 10 features relating to the presence of characteristic sequence motifs found by using MOn-STER ( 59 ).To inspect whether these features are discriminant of putative effector properties, we plotted the distribution of each, for the positive and negative datasets, respectively ( Supplementary Figure S1 ).Twenty-four features show a significantly different distribution (Mann-Whitney test, Pvalue < 0.05) by comparing the two datasets.However, we decided to include all features because non-significantly different features can be specific to a small class of putative effec- ( B ) LEAPH w orkflo w.T he dark green bo x highlights the training process while the light green bo x represents the application process of LEAPH; the dashed box depicts the particular use case in which a feature table is already available, thus ready to be used as input for the pre-trained LEAPH model.tors, too small to reach statistical significance when including all the effector proteins together.
Overall, both the datasets and the features are suitable for setting up a classifier based on machine learning models.

The four classifiers included in LEAPH captured distinct features to predict pathogenicity proteins
Because of the wide variety of hosts and infection symptoms, we expect phytoplasmas' effectors to have different characteristics, thus the hypothesis is that different learning models would be able to capture diverse properties, yielding a more comprehensive prediction.Therefore, we used an ensemble learning composed of two tree-based algorithms, namely random forest and XGBoost ( 60 ,61 ), and two naive Bayes classifiers, i.e. a Gaussian and a multinomial model ( 62 ,63 ).We fed the four classifiers with our training dataset and measured their performances on the test set ( Supplementary Table S4 ).Overall, the four models showed very good and comparable performance both within the cross-validation folds and between methods, ranging from 95% to 99% for the four measures used: precision, recall, accuracy and F 1 score ( Supplementary Table S6 ; Supplementary Figure S2 ).The best performing model was selected for each classifier method and included in the ensemble approach that we call LEAPH (ensemb L e model for E ffector cl A ssification in PH ytoplasmas).To each candidate pathogenicity protein predicted with LEAPH, a score ranging from one to four is associated, corresponding to the number of classifiers that agree on the outcome.
In Figure 2 A, we report the Venn diagram showing the agreement of the prediction obtained by the four models.Most candidate effectors were predicted by the four models (96%), as expected.Interestingly, the two naive Bayes classifiers, and mainly the multinomial model, were able to identify putative effectors that were not identified by the two treebased models, supporting our hypothesis that different classifiers might capture different patterns of candidate pathogenicity proteins.To further investigate this direction, we calculated the feature importance for each classifier by using the SHAP  algorithm ( 69 ) which is a game theory approach to explain the output of any machine learning model.Indeed, as shown in Figure 2 B and Supplementary Figure S3 , we observe a quite different spectrum of features, with the highest contribution scores depending on the classifier.While the feature describing the first 60 amino acids of the protein sequence is found among the top two for all methods, the other features seem to contribute differently, depending on the classifier.In particular, we observed that only for the multinomial naive Bayes, the sequence length has a high impact on the model output.On the other hand, for the Gaussian naive Bayes model, the mTMRs make the greatest contribution.Moreover, the top 10 rank of the contributing features for all the models include the presence of three functional motifs related to protein modification sites: myristoylation, glycosylation and phosphorylation sites.This means that the occurrence of these motifs in the protein is a discriminative predictive sign for each of the included models.
Altogether these results show that the four classifiers included in LEAPH can identify potential pathogenicity proteins with different characteristics and confer good performance on the test dataset.

LEAPH outperformed other methods with regard to the prediction of effectors in phytoplasmas
To assess the performances of LEAPH compared with stateof-the-art tools for effector prediction, we compared LEAPH application to the training dataset with five classifiers available for effector prediction, namely SignalP4.1 ( 53 ) in combination, or not, with TMHMM2.0 ( 54 ) which we consider here the conventional method to predict putative effectors in phytoplasmas; the newly implemented PhyEffector ( 46 ) tailored for phytoplasma effector prediction; EffectorO ( 15 ); Ef-fectorP3.0( 18 ); and the three models for Deepredeff ( 19 ) that are built to predict putative effectors in other species ( Supplementary Table S7 ; Figure 3 ).LEAPH outperformed all the other methods by considering the four metrics used in this study, achieving a recall of 98.4%, thus predicting only three false negatives.In this context, the minimization of false negatives is crucial, to minimize the neglecting of putative true effectors.Importantly, the tool LEAPH also achieved good precision (95.3%), thus implying a low number of false positives as well (nine in total).Although experimental validation is central to validate putative candidate effectors predicted by computational methods, the low number of both false negatives and false positives increases the confidence in the prediction and will permit a more successful experimental validation.
Application of EffectorP3.0,EffectorO and Deepredefffungi showed quite good recall but very low values for the other metrics.This is due to the high number of false positives identified by those methods that lower the accuracy (ranging from 38.8% to 42.4%) and the precision (between 36.3% and 39.6%).On the other hand, the use of EffectorP3.0 and Deepredeff-fungi do not identify many false negatives, with a recall of 98.9% and 91.8%, respectively.Lower performances are obtained for EffectorO with a recall of 78.3%.The tools Deepredeff-oomycetes and Deepredeff-bacteria showed very poor performance.
The three combinations of SignalP4.1 with TMHMM2.0,namely SignalP4.1,SignalP4.1 + TMHMM2.0 and SignalP4.1 / TMHMM2.0 (in Supplementary Table S7 ) applied to our data achieved comparable performances mainly in terms of accuracy concerning LEAPH.However, their recalls are lower than the value achieved by our tool, suggesting that these methods might miss potential bona fide effector candidates.Since SignalP4.1 and TMHMM2.0 by definition can identify only putative classically secreted effectors, we hypothesize that these false negatives can be potential nonclassically secreted pathogenicity proteins that are captured by LEAPH thanks to the four classifiers.As expected, PhyEffector application showed good performances on the training set, reaching 92.7% accuracy, 91.2% precision, 89.7% recall and an overall F 1 score of 90.4% ( Supplementary Table S7 ; Figure 3 ).Surprisingly, accuracy and F 1 score were found to be slightly lower than those achieved by SignalP4.1 + TMHMM2.0.
Altogether these results highlight the importance of developing a more flexible and informative method tailored for phytoplasmas to identify bona fide candidate effectors and rule out EffectorP3.0, EffectorO, Deepredeff and PhyEffector from further comparisons.

LEAPH predicts classical, classical-like and non-classical secreted putative pathogenicity proteins from 13 phytoplasma proteomes
To test the capability of LEAPH to identify potential novel effectors, we selected 13 phytoplasma proteomes from 10 ' Ca.Phytoplasma' species with different characteristics in terms of the number of proteins, 16S group, type of symptoms and number of hosts ( Supplementary Table S8 ; Figure 4 ).On average, LEAPH predicts as putative pathogenicity proteins ∼30 ± 2.7% of the proteomes except for three strains: CaPhoeni-cium_SA213, CaPhoenicium_ChiP and CaPoryzae_NGS-S10, for which we have found 22.63, 27.65 and 28.32%, respectively (Figure 4 A).
To characterize the properties of these putative pathogenicity proteins, we compared them with the prediction of SignalP4.1 + TMHMM2.0 and SecretomeP2.0( 68 ).Sig-nalP4.1 + TMHMM2.0 is used to identify proteins having a signal peptide, usually present in classically secreted proteins, whereas SecretomeP2.0predicts non-classically secreted proteins.Importantly, both methods predict whether the protein is secreted or not without information on its pathogenicity.LEAPH predicts between 40% and 61% of putative effector proteins that were not identified by the other two methods (Figure 4 B).Between 33% and 58% of LEAPH putative effectors also show the signal peptide and the mTMR identified by TMHMM2.0,thus implying that they are classically secreted.These results pinpoint that LEAPH, despite the fact that it was trained on a dataset composed mainly of classically secreted effector proteins, can capture characteristics to identify potential effectors beyond the classically secreted ones.On the other hand, very few effectors predicted by LEAPH were also predicted by SecretomeP2.0.This was expected because the model used to run SecretomeP2.0was suitable for Gram-negative bacteria, but no model was available for Gram-positive bacteria.According to Gao et al .( 75), SecretomeP2.0underestimates non-classically secreted proteins when using the Gram-negative bacteria model for Gram-positive bacteria.Interestingly, between 77% and 89% of putative effectors predicted exclusively by LEAPH contain only the mTMR (without the signal peptide), supporting the hypothesis that phytoplasma effectors might have a peculiar signal peptide located in the so-called sequence variable mosaic (SVM) regions of some proteins ( 76 ).

Description of the effector protein landscape predicted by LEAPH
To further characterize the putative effectors predicted using LEAPH, we performed a PCA on protein features (see the Materials and methods).In Figure 5 , we observe three distinct groups.To understand whether these groups could represent some property of protein sequences, we colored them according to the type of secretion.We defined as 'classically' secreted, those proteins in which both a signal peptide and mTMR are predicted; 'classically like' secreted, those having only the prediction for the mTMR; and 'non-classically' secreted, those proteins in which neither signal peptide nor mTMRs are predicted (Figure 5 A).The three groups identified with the PCA respectively follow the three secretion modalities.Thus, we checked where the validated effectors are located in the putative pathogenicity protein landscape identified with the PCA (Figure 5 B).The 11 classically secreted validated effectors (two SAP11 in two different strains, SAP54, SAP05, PHYL1, TENGU, SWP1, SWP11, SWP12, SWP16 and SWP21) were localized in the cluster of classically secreted as expected.Surprisingly, among the four non-classically secreted validated effectors, only two were found in the corresponding cluster, namely ncSecP3 and ncSecP12.Unexpectedly, by inspecting the protein sequences, we found that nc-SecP14 contains the signal peptide and the mTMR, and nc-SecP22 only the mTMR, thus confirming their localization on the PCA.
Then, we mapped other properties on the three PCA groups.Interestingly, inspecting the second principal component, we noticed that the putative pathogenicity proteins found using LEAPH are stratified by the probability for the N-terminus to have a cytoplasmic location, as shown in Figure 5 C. When coloring by species, we observed that there are no specific clusters, as expected, meaning that pathogenicity protein properties are not species specific ( Supplementary Figure S4 A).In Supplementary Figure S4 B, we can observe that proteins are stratified by sequence length.Surprisingly we observe that few proteins from the classically secreted and non-classically secreted groups have a size larger than expected.Thus, we mapped these proteins on the sequence length distribution of the respective proteomes to see whether these sizes could be potential outliers or not ( Supplementary Figure S4 C).We observed that very few putative effectors, mainly predicted as classically secreted, map on the tails of these distributions.
To investigate other possible peculiar properties of LEAPHpredicted putative pathogenicity proteins, we used Inter-ProScan5 ( 71 ) to predict protein domains in the three classes of secretion.We considered only domains with an occurrence of at least 1% for at least one of the secretion classes, for a total of 37 protein domains.As shown in Figure 5 D, classically secreted putative effectors mainly possess domains characteristic of both SVM proteins, AAA+ ATPases and FtsH proteins.SVM proteins were reported for the presence of a modified secretion signal ( 76 ).Recently, it has been demonstrated that the AAA+ and Ftsh domains have a role in pathogenicity for ' Ca.P. mali' ( 81 ,82 ).Classical-like secreted candidate pathogenicity proteins, on the other hand, are mostly characterized by periplasmic-binding proteins or ABC transporters and only slightly enriched in AAA+ and Ftsh domains.Although ABC transporters are found ubiquitously in eukaryotes and prokaryotes, they play a crucial role in pathogenesis and virulence in pathogen bacteria ( 83 ).There are different types of ABC transporters [10 according to Zeng and Charkowski ( 84 )] and this might be linked to the presence of different ABC-related domains characterizing the nonclassically secreted group of predicted pathogenicity proteins.In this class, ABC transporters are less frequent concerning the P-ATPases and especially the IDRs.P-ATPases, particularly copper exporter P-type ATPases, are reported to play a major role in the virulence of diverse pathogenic bacteria even if the underlying mechanisms remain partially understood (85)(86)(87)(88)(89).These findings suggest that putative effectors are distinguished not only by the secretion mode, but also by other characteristics related to virulence properties embedded in the sequence composition.

SOM clustering allowed us to build a pathogenicity protein reference map for phytoplasmas
To study whether other properties are characteristic of subclasses of pathogenicity proteins beyond the type of secretion, we performed a clustering analysis on the LEAPH-predicted proteins using the SOM model ( 72 ).This analysis allowed us to create a 2D map composed of a 10 × 10 lattice where each hexagonal cell is characterized by a particular combination of features and closer cells have more similar properties than cells further away.Thus, the 2093 putative pathogenicity proteins, including the 15 validated effectors, identified with LEAPH across 13 phytoplasma proteomes are distributed into the map and associated with a specific hexagonal cell according to their sequence properties.Consistent with PCA, the two main properties that stratify the hexagons in the lattice are the signal peptide and the mTMR on the x -axis, and the probability for the N-terminus to have a cytoplasmic location on the y -axis (Figure 6 A).Therefore, we can visualize on the map the hexagons that correspond to the three groups found by the PCA: 49 hexagons for classically secreted, 31 for classically like and seven for non-classically secreted putative pathogenicity predicted proteins.Furthermore, eight hexagons that contain proteins from both groups of classically like and non-classically secreted and five empty hexagons can be identified on the map.6 A).As expected the two homologous phyllogen proteins (PHYL1 and SAP54) are in the same hexagon (h38).Similarly, SWP21 and TENGU are in the same hexagon (h77).The SAP11-like proteins do not fall in the same hexagon but in hexagons next to each other, suggesting more variability in the sequence properties of proteins in this class depending on the phytoplasma species.Intriguingly, SWP1 and SWP11, which have different roles in disease development ( 77 ), are in the same hexagon (h18).Interestingly, both interact with the TEOSINTE BRANCHED 1 / CYCLOIDEA / PROLIFERATING CELL FA CT OR 1 and 2 (TCP) transcription factors, thus suggesting shared sequence properties to enhance the interaction with similar partners to investigate further ( 90 ).The two non-classically secreted validated effectors (ncSecP12 and ncSecP3) are mapped in the same hexagon (h35), also suggesting very similar sequence properties.Importantly, LEAPH predicted new putative effectors which are mapped in the same hexagons as the biologically validated effectors offering novel potential effector candidates with similar sequence characteristics to be experimentally validated.

EffectorComb: a Shiny app to inspect the map of the pathogenicity proteins
To further inspect other properties and characteristics of the sequences contained in the pathogenicity protein map, we have developed a Shiny application, called EffectorComb, enabling us to show which features are enriched in each hexagon ( Supplementary Figure S5 ).For instance, we notice that the enrichment of phosphorylation sites is mainly present in hexagons collecting togethger classically secreted putative pathogenicity proteins.Specifically, hexagon h70 is characterized by a high presence of the domain cAMP-and cGMPdependent protein kinase phosphorylation site.Similarly, the Protein kinase C phosphorylation site and Casein kinase II phosphorylation site were found predominantly in h90.Tyrosine kinase phosphorylation sites 1 and 2 were also found enriched in h25, h99, h100 and h48.Protein phosphorylation is a widely used mechanism in bacteria to adapt to changes in their environment (where the conditions can alter rapidly), but it is also used for intercellular communication.The Nglycosylation site, specific to the consensus sequence Asn-Xaa-Ser / Thr, was found enriched in both classically like (h74) and classically secreted (h89) candidate pathogenicity proteins.Bacteria have evolved diverse glycosylation systems for pathogenesis.Bacterial glycosylation not only allows adhesion to the host cell but also functions to modulate crucial host cellular processes ( 91 ).Notably, glycosylation has been reported as an important process in the interaction between the 'flavescence dorée' phytoplasma and its insect vector ( 92 ), but also of phytoplasma strains and host plants ( 93 ,94 ).Interestingly the hexagon h74 is also characterized by the presence of the N -myristoylation site.N -myristoylation is an irreversible protein lipidation; in plants it is recognized as a major modification and concerns nearly 2% of all plant proteins ( 95 ,96 ).Attachment of a myristoyl group increases specific proteinprotein interactions leading to the subcellular localization of myristoylated proteins with their signaling partners ( 97 ).Intriguingly, Medina-Puche et al .( 98) identified novel mechanisms involving protein relocation due to the presence of the N -myristoylation site from the plasma membrane to chloroplasts that are utilized in plants for defense responses against pathogens.Furthermore, amidation sites were found mainly in the hexagons h10 and h40.C-terminal amidation reduces the overall charge of a peptide; therefore, its overall solubility might decrease.Modifications at the C-terminus have already been shown to have consequences on the pathogenicity of phytoplasmas; for instance, deletions at the C-terminus of the effector SWP1 impaired the induction of witches' broom ( 99 ).The enrichment of this domain might suggest that changes in protein solubility are important for phytoplasma effectors and might reflect distinct functions also depending on the different hosts (insect / plant).The leucine zipper domain is found only in the classically secreted putative pathogenicity proteins lying in h19.This pattern is present in many gene regulatory proteins, making this cluster very interesting for further investigation ( 100 ).Finally, the 'RGD' tripeptide that has been shown to play a role in cell adhesion ( 101) is abundant in both classically (h82 and h98) and non-classically (h1, h17 and h25) predicted pathogenicity proteins.
Overall, the SOM map allowed us to identify distinct classes of putative pathogenicity proteins with particular properties probably linked to their function and type of interaction with the plant and / or insect.

Conclusions
The few known biologically validated effectors in phytoplasmas are implicated in a huge spectrum of different functions, ranging from the regulation of plant morphogenesis to attracting insect vectors.They possess a variety of strategies to manipulate the host plants.Therefore, it is unlikely for phytoplasma effectors to be described by a few common characteristics.However, the most widely used method to select pathogenicity candidates for phytoplasmas is to screen protein sequences and look only for the presence of the signal peptide.Recently, a novel tool was introduced, PhyEffector, consisting of a pipeline of four different tools aimed at the identification of signal peptides and TM domains, and inspecting sequence similarity ( 46 ).Despite the promising performances of this tool, the results still lack a wider comprehension of effector physicochemical characteristics, a system by which to prioritize and evaluate the accuracy of the predictions and tools to help speed up the experimental validation process.Leveraging four machine learning classifiers coupled with a novel voting score for assessing the prediction score, we have developed LEAPH specifically conceived for phytoplasma effector candidate prediction.We have shown that using LEAPH, it is possible to predict putative effector candidates with high reliability and outperform existing tools not adapted for phy-toplasmas.A major advantage of LEAPH consists of its ability to predict candidate pathogenicity proteins independently by the presence of the signal peptide and with many different sequence characteristics.Furthermore, as our knowledge of effector proteins increases, the early stage of LEAPH can be refined by constituting a novel training dataset to re-train the models to improve their performances.
The application of LEAPH on 13 phytoplasma proteomes led to the identification of 2093 putative candidates showing sequence characteristics linked to virulence and pathogenicity.To investigate the properties of those proteins, we used the SOM algorithm and obtained the first pathogenicity protein map for this pathogen.To provide an easy exploration of the pathogenicity protein map, we have developed an easyto-use Shiny application called EffectorComb.To the best of our knowledge, this is the first time that a comprehensive characterization of putative pathogenicity proteins has been provided for phytoplasmas.The use of this map goes beyond the identification of groups of putative pathogenicity proteins having similar properties, but can be used as a reference map to project new predicted pathogenicity proteins from other proteomes in a predictive perspective.Therefore, when using this map as a reference to analyze a new sample, the positioning of each predicted pathogenicity protein of a new proteome on the 2D plane can be used to assess its pathogenicity properties and overall sequence characteristics.The pathogenicity protein map can also be used to accelerate the choice of novel candidates for biological validation, for instance by focusing on proteins in the same group of known validated effectors or on groups showing particular sequence properties of interest.
Overall, an ensemble of four learning model never used before in phytoplasma effector prediction, the introduction of a prediction prioritization score based on model agreement, the delineation of the phytoplasma effector reference map coming from a wide range of phytoplasma proteomes and the possibility to easily explore it, interact with it and retrieve protein sequences from it, make LEAPH and Effector-Comb novel and valuable tools to improve our understanding of effectors in plant-phytoplasma interactions.Finding novel experimentally validated effectors can be used to set up novel methods to improve plant resistance to these devastating bacteria.

Data availability
The datasets supporting the conclusions of this article are included within the article and its additional files.

Figure 1 .
Figure 1.Training dataset composition and LEAPH w orkflo w. ( A ) Training dataset composition of the positive and negative set proportion (in the center) and phytoplasma species for the positive dataset (on the left) and negative dataset (on the right).Refer to Supplementary Tables S1.1 and S1.2 for details.( B ) LEAPH w orkflo w.T he dark green bo x highlights the training process while the light green bo x represents the application process of LEAPH; the dashed box depicts the particular use case in which a feature table is already available, thus ready to be used as input for the pre-trained LEAPH model.

Figure 2 .
Figure 2. LEAPH and SHAP results on the training dataset.( A ) Venn diagram representing the agreement of prediction on the training set from the four models included in LEAPH, namely random forest model (purple ellipse), XGBoost (dark blue ellipse), Gaussian naive Bayes model (aquamarine ellipse) and multinomial naive Bayes model (yellow ellipse).( B ) The first 10 features in terms of SHAP value importance for each model included in LEAPH.Refer to Supplementary Figure S3 for the complete representation of SHAP importance.

Figure 3 .
Figure 3. Performances of LEAPH and state-of-the-art tools for effector prediction.Performances of each tested state-of-the-art tool compared with LEAPH performance.For each tool, the percentage of accuracy, F-measure, precision and recall after the application on the training set is reported.

Figure 4 .
Figure 4. P utativ e pathogenicity proteins predicted b y LEAPH from 13 ph ytoplasma proteomes.( A ) Percent age of put ative pathogenicit y proteins predicted by LEAPH (orange bars) compared with the total percentage of proteins (gray bars) for the corresponding phytoplasma proteome.( B ) Comparison of the proportion of proteins predicted only by LEAH (green bars), predicted by the SignalP + TMHMM method and LEAPH (yellow bars) or by SectetomeP2.0 and LEAPH (magenta bars).

Figure 5 .
Figure 5. Description of the effector protein landscape predicted by LEAPH.( A ) PCA of predicted pathogenicity proteins by LEAPH from 13 phytoplasma proteomes colored by secretion methodology; group1 (dark green points) predicted by SignalP4.1 to bear a signal peptide (classically secreted), group2 (light y ello w) contains an mTMR in the first 60 amino acids of the sequence from TMHMM2.0 and no prediction of signal peptide b y SignalP4.1 (classically like secreted) and group3 (light green points) do not have the signal peptide and the mTMR (non-classically secreted).( B ) The same as (A) with the position of the 15 experimentally validated effectors from the literature (dark red dots).( C ) Same as (A) colored by the probability that the N-terminal region of the protein is in the inner part of the membrane.( D ) The abundance of each protein domain predicted by InterProScan in each PCA group.

Figur e 6 .
Figur e 6. P athogenicity protein reference map f or ph ytoplasmas.( A ) LEAPH predicted proteins projected on an SOM and colored b y their secretion group: classically secreted (dark green points and dark green he xagons), classically lik e secreted (light y ello w points and green hexagons), non-classically secreted (light green points and light green hexagons) and mixed secretion modality (olive green hexagons).Dark red contours indicate the hexagons containing the validated effectors from the literature.Empty gray hexagons represent profiles of features that are not represented by the predicted proteins.( B ) The same as (A) showing the abundance of features for each hexagon.